Ty: Lossless Data Compression for Analytics-driven Query Processing

نویسندگان

  • Nagiza F. Samatova
  • Isha Arkatkar
  • Rada Chirkova
  • Kemafor Anyanwu
چکیده

ARKATKAR, ISHA. ALACRI2TY: Lossless Data Compression for Analytics-driven Query Processing. (Under the direction of Nagiza F. Samatova.) Analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require us to look for alternative ways of performing query-driven analyses. This thesis is an attempt in the direction of query processing over losslessly compressed scientific data. We propose ALACRI2TY (Analytics-driven Lossless dAta Compression for Rapid In-situ Indexing, sToring, and querYing), which at its core consists of two components: lossless compressor and query processing engine over compressed data. ALACRI2TY’s compression component performs compression of double precision scientific data by unique value-based binning. Based on significant bit splitting, ALACRI2TY improves compression ratios over general-purpose compression utilities. It then indexes the metadata about the compression rather than the data to enable light-weight index storage. The query processing engine answers range queries over this compressed data with a low degree of unnecessary decompression. ALACRI2TY’s methodology involving compression and binning enables (1) Indexing with a total storage requirement (data+index) of less than 135% (versus 200-300% in existing scientific database systems); (2) Data access at multiple precision levels of detail necessitated by the varying sensitivity of analytical kernels (e.g., low-precision for histograms and descriptive statistics, medium-precision for clustering, and full-precision for Fourier analysis); (3) Robust performance across univariate as well as multi-variate query constraints via efficient bitmapbased aggregation of partial results. Altogether, these capabilities yield a multi-fold improvement in query response time over state-of-the-art systems such as FastBit, MonetDB, and SciDB when tested on several realworld data sets from scientific simulations and using the high-end compute clusters and Lustre file system at Oak Ridge National Laboratory. c © Copyright 2012 by Isha Arkatkar

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying

The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We prop...

متن کامل

ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying

High-performance computing architectures face nontrivial data processing challenges, as computational and I/O components further diverge in performance trajectories. For scientific data analysis in particular, methods based on generating heavyweight access acceleration structures, e.g. indexes, are becoming less feasible for ever-increasing dataset sizes. We present ALACRITY, demonstrating the ...

متن کامل

Lossless Microarray Image Compression by Hardware Array Compactor

Microarray technology is a new and powerful tool for concurrent monitoring of large number of genes expressions. Each microarray experiment produces hundreds of images. Each digital image requires a large storage space. Hence, real-time processing of these images and transmission of them necessitates efficient and custom-made lossless compression schemes. In this paper, we offer a new archi...

متن کامل

Improving Compression Efficiency of Data Warehouse

Data compression has a paramount effect on Data warehouse for reducing data size and improving query processing. Distinct compression techniques are feasible at different levels, each of types either give good compression ratio or suitable for query processing. This paper focuses on applying lossless and lossy compression techniques on relational databases. The proposed technique is used at att...

متن کامل

Factorized Databases: A Knowledge Compilation Perspective

This paper overviews recent work on compilation of relational queries into lossless factorized representations. The primary motivation for this compilation is to avoid redundancy in the representation of query results and speed up their computation and subsequent analytics.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011